{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"center\"><h1>Accelerating Applications with CUDA C/C++</h1></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![CUDA](./images/CUDA_Logo.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accelerated computing is replacing CPU-only computing as best practice. The litany of breakthroughs driven by accelerated computing, the ever increasing demand for accelerated applications, programming conventions that ease writing them, and constant improvements in the hardware that supports them, are driving this inevitable transition.\n",
    "\n",
    "At the center of accelerated computing's success, both in terms of its impressive performance, and its ease of use, is the [CUDA](https://developer.nvidia.com/about-cuda) compute platform. CUDA provides a coding paradigm that extends languages like C, C++, Python, and Fortran, to be capable of running accelerated, massively parallelized code on the world's most performant parallel processors: NVIDIA GPUs. CUDA accelerates applications drastically with little effort, has an ecosystem of highly optimized libraries for [DNN](https://developer.nvidia.com/cudnn), [BLAS](https://developer.nvidia.com/cublas), [graph analytics](https://developer.nvidia.com/nvgraph), [FFT](https://developer.nvidia.com/cufft), and more, and also ships with powerful [command line](http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview) and [visual profilers](http://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual).\n",
    "\n",
    "CUDA supports many, if not most, of the [world's most performant applications](https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/?product_category_id=58,59,60,293,98,172,223,227,228,265,487,488,114,389,220,258,461&search=) in, [Computational Fluid Dynamics](https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/?product_category_id=10,12,16,17,19,51,53,71,87,121,124,156,157,195,202,203,204,312,339,340,395,407,448,485,517,528,529,541,245,216,104,462,513,250,492,420,429,490,10,12,16,17,19,51,53,71,87,121,124,156,157,195,202,203,204,312,339,340,395,407,448,485,517,528,529,541,245,216,104,462,513,250,492,420,429,490,10,12,16,17,19,51,53,71,87,121,124,156,157,195,202,203,204,312,339,340,395,407,448,485,517,528,529,541,245,216,104,462,513,250,492,420,429,490&search=), [Molecular Dynamics](https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/?product_category_id=8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,398,400,435,507,508,519,8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,398,400,435,507,508,519,8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,398,400,435,507,508,519,8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,398,400,435,507,508,519&search=), [Quantum Chemistry](https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/?product_category_id=8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,398,400,435,507,508,519,8,57,92,123,211,213,237,272,274,282,283,307,325,337,344,345,351,362,365,380,396,398,400,435,507,508,519&search=), [Physics](https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/catalog/?product_category_id=6,24,116,118,119,135,229,231,372,373,392,393,489,493,494,495,496,497,498,67,170,216,281,6,24,116,118,119,135,229,231,372,373,392,393,489,493,494,495,496,497,498,67,170,216,281,6,24,116,118,119,135,229,231,372,373,392,393,489,493,494,495,496,497,498,67,170,216,281,6,24,116,118,119,135,229,231,372,373,392,393,489,493,494,495,496,497,498,67,170,216,281,6,24,116,118,119,135,229,231,372,373,392,393,489,493,494,495,496,497,498,67,170,216,281&search=) and HPC.\n",
    "\n",
    "Learning CUDA will enable you to accelerate your own applications. Accelerated applications perform much faster than their CPU-only couterparts, and make possible computations that would be otherwise prohibited given the limited performance of CPU-only applications. In this lab you will receive an introduction to programming accelerated applications with CUDA C/C++, enough to be able to begin work accelerating your own CPU-only applications for performance gains, and for moving into novel computational territory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Prerequisites\n",
    "\n",
    "To get the most out of this lab you should already be able to:\n",
    "\n",
    "- Declare variables, write loops, and use if / else statements in C.\n",
    "- Define and invoke functions in C.\n",
    "- Allocate arrays in C.\n",
    "\n",
    "No previous CUDA knowledge is required."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Objectives\n",
    "\n",
    "By the time you complete this lab, you will be able to:\n",
    "\n",
    "- Write, compile, and run C/C++ programs that both call CPU functions and **launch** GPU **kernels**.\n",
    "- Control parallel **thread hierarchy** using **execution configuration**.\n",
    "- Refactor serial loops to execute their iterations in parallel on a GPU.\n",
    "- Allocate and free memory available to both CPUs and GPUs.\n",
    "- Handle errors generated by CUDA code.\n",
    "- Accelerate CPU-only applications."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Accelerated Systems\n",
    "\n",
    "*Accelerated systems*, also referred to as *heterogeneous systems*, are those composed of both CPUs and GPUs. Accelerated systems run CPU programs which in turn, launch functions that will benefit from the massive parallelism provided by GPUs. This lab environment is an accelerated system which includes an NVIDIA GPU. Information about this GPU can be queried with the `nvidia-smi` (*Systems Management Interface*) command line command. Issue the `nvidia-smi` command now, by `CTRL` + `ENTER` on the code execution cell below. You will find these cells throughout this lab any time you need to execute code. The output from running the command will be printed just below the code execution cell after the code runs. After running the code execution block immediately below, take care to find and note the name of the GPU in the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## GPU-accelerated Vs. CPU-only Applications\n",
    "\n",
    "The following slides present upcoming material visually, at a high level. Click through the slides before moving on to more detailed coverage of their topics in following sections.\n",
    "\n",
    "<script>console.log('hi');</script>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vTdUbQjoEYAtcPCMX4ZVLa9gE0rbO3ODClJqgtzRaXoFgVmTJrOkXGDAYmA0BsuTEaokTASv84JsKLm/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%HTML\n",
    "\n",
    "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vTdUbQjoEYAtcPCMX4ZVLa9gE0rbO3ODClJqgtzRaXoFgVmTJrOkXGDAYmA0BsuTEaokTASv84JsKLm/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Writing Application Code for the GPU\n",
    "\n",
    "CUDA provides extensions for many common programming languages, in the case of this lab, C/C++. These language extensions easily allow developers to run functions in their source code on a GPU.\n",
    "\n",
    "Below is a `.cu` file (`.cu` is the file extension for CUDA-accelerated programs). It contains two functions, the first which will run on the CPU, the second which will run on the GPU. Spend a little time identifying the differences between the functions, both in terms of how they are defined, and how they are invoked.\n",
    "\n",
    "```cpp\n",
    "void CPUFunction()\n",
    "{\n",
    "  printf(\"This function is defined to run on the CPU.\\n\");\n",
    "}\n",
    "\n",
    "__global__ void GPUFunction()\n",
    "{\n",
    "  printf(\"This function is defined to run on the GPU.\\n\");\n",
    "}\n",
    "\n",
    "int main()\n",
    "{\n",
    "  CPUFunction();\n",
    "\n",
    "  GPUFunction<<<1, 1>>>();\n",
    "  cudaDeviceSynchronize();\n",
    "}\n",
    "```\n",
    "\n",
    "Here are some important lines of code to highlight, as well as some other common terms used in accelerated computing:\n",
    "\n",
    "`__global__ void GPUFunction()`\n",
    "  - The `__global__` keyword indicates that the following function will run on the GPU, and can be invoked **globally**, which in this context means either by the CPU, or, by the GPU.\n",
    "  - Often, code executed on the CPU is referred to as **host** code, and code running on the GPU is referred to as **device** code.\n",
    "  - Notice the return type `void`. It is required that functions defined with the `__global__` keyword return type `void`.\n",
    "\n",
    "`GPUFunction<<<1, 1>>>();`\n",
    "  - Typically, when calling a function to run on the GPU, we call this function a **kernel**, which is **launched**.\n",
    "  - When launching a kernel, we must provide an **execution configuration**, which is done by using the `<<< ... >>>` syntax just prior to passing the kernel any expected arguments.\n",
    "  - At a high level, execution configuration allows programmers to specify the **thread hierarchy** for a kernel launch, which defines the number of thread groupings (called **blocks**), as well as how many **threads** to execute in each block. Execution configuration will be explored at great length later in the lab, but for the time being, notice the kernel is launching with `1` block of threads (the first execution configuration argument) which contains `1` thread (the second configuration argument).\n",
    "\n",
    "`cudaDeviceSynchronize();`\n",
    "  - Unlike much C/C++ code, launching kernels is **asynchronous**: the CPU code will continue to execute *without waiting for the kernel launch to complete*.\n",
    "  - A call to `cudaDeviceSynchronize`, a function provided by the CUDA runtime, will cause the host (CPU) code to wait until the device (GPU) code completes, and only then resume execution on the CPU."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Exercise: Write a Hello GPU Kernel\n",
    "\n",
    "The [`01-hello-gpu.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/01-hello/01-hello-gpu.cu) (*<---- click on the link of the source file to open it in another tab for editing*) contains a program that is already working. It contains two functions, both with print \"Hello from the CPU\" messages. Your goal is to refactor the `helloGPU` function in the source file so that it actually runs on the GPU, and prints a message indicating that it does.\n",
    "\n",
    "- Refactor the application, before compiling and running it with the `nvcc` command just below (remember, you can execute the contents of the code execution cell by `CTRL + ENTER` it). The comments in [`01-hello-gpu.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/01-hello/01-hello-gpu.cu) will assist your work. If you get stuck, or want to check your work, refer to the [solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/01-hello/solutions/01-hello-gpu-solution.cu). Don't forget to save your changes to the file before compiling and running with the command below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvcc -arch=sm_70 -o hello-gpu 01-hello/01-hello-gpu.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After successfully refactoring [`01-hello-gpu.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/01-hello/01-hello-gpu.cu), make the following modifications, attempting to compile and run it after each change (by `CTRL + ENTER` clicking on the code execution cell above). When given errors, take the time to read them carefully: familiarity with them will serve you greatly when you begin writing your own accelerated code.\n",
    "\n",
    "- Remove the keyword `__global__` from your kernel definition. Take care to note the line number in the error: what do you think is meant in the error by \"configured\"? Replace `__global__` when finished.\n",
    "- Remove the execution configuration: does your understanding of \"configured\" still make sense? Replace the execution configuration when finished.\n",
    "- Remove the call to `cudaDeviceSynchronize`. Before compiling and running the code, take a guess at what will happen, recalling that kernels are launched asynchronously, and that `cudaDeviceSynchronize` is what makes host execution in wait for kernel execution to complete before proceeding. Replace the call to `cudaDeviceSynchronize` when finished.\n",
    "- Refactor `01-hello-gpu.cu` so that `Hello from the GPU` prints **before** `Hello from the CPU`.\n",
    "- Refactor `01-hello-gpu.cu` so that `Hello from the GPU` prints **twice**, once  **before** `Hello from the CPU`, and once **after**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Compiling and Running Accelerated CUDA Code\n",
    "\n",
    "This section contains details about the `nvcc` command you issued above to compile and run your `.cu` program.\n",
    "\n",
    "The CUDA platform ships with the [**NVIDIA CUDA Compiler**](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html) `nvcc`, which can compile CUDA accelerated applications, both the host, and the device code they contain. For the purposes of this lab, `nvcc` discussion will be pragmatically scoped to suit our immediate needs. After completing the lab, For anyone interested in a deeper dive into `nvcc`, start with [the documentation](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html).\n",
    "\n",
    "`nvcc` will be very familiar to experienced `gcc` users. Compiling, for example, a `some-CUDA.cu` file, is simply:\n",
    "\n",
    "`nvcc -arch=sm_70 -o out some-CUDA.cu -run`\n",
    "  - `nvcc` is the command line command for using the `nvcc` compiler.\n",
    "  - `some-CUDA.cu` is passed as the file to compile.\n",
    "  - The `o` flag is used to specify the output file for the compiled program.\n",
    "  - The `arch` flag indicates for which **architecture** the files must be compiled. For the present case `sm_70` will serve to compile specifically for the Volta GPUs this lab is running on, but for those interested in a deeper dive, please refer to the docs about the [`arch` flag](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-steering-gpu-code-generation), [virtual architecture features](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list) and [GPU features](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list).\n",
    "  - As a matter of convenience, providing the `run` flag will execute the successfully compiled binary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## CUDA Thread Hierarchy\n",
    "\n",
    "The following slides present upcoming material visually, at a high level. Click through the slides before moving on to more detailed coverage of their topics in following sections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vQYti_rVyWNNOccK6Slxd1VqazuqO5IhP17tmk-yTZAQfPEVpF14aZF9Vo3XkrDbFetNLTm_Pnk7JvD/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%HTML\n",
    "\n",
    "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vQYti_rVyWNNOccK6Slxd1VqazuqO5IhP17tmk-yTZAQfPEVpF14aZF9Vo3XkrDbFetNLTm_Pnk7JvD/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Launching Parallel Kernels\n",
    "\n",
    "The execution configuration allows programmers to specify details about launching the kernel to run in parallel on multiple GPU **threads**. More precisely, the execution configuration allows programmers to specifiy how many groups of threads - called **thread blocks**, or just **blocks** - and how many threads they would like each thread block to contain. The syntax for this is:\n",
    "\n",
    "`<<< NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK>>>`\n",
    "\n",
    "** The kernel code is executed by every thread in every thread block configured when the kernel is launched**.\n",
    "\n",
    "Thus, under the assumption that a kernel called `someKernel` has been defined, the following are true:\n",
    "  - `someKernel<<<1, 1>>()` is configured to run in a single thread block which has a single thread and will therefore run only once.\n",
    "  - `someKernel<<<1, 10>>()` is configured to run in a single thread block which has 10 threads and will therefore run 10 times.\n",
    "  - `someKernel<<<10, 1>>()` is configured to run in 10 thread blocks which each have a single thread and will therefore run 10 times.\n",
    "  - `someKernel<<<10, 10>>()` is configured to run in 10 thread blocks which each have 10 threads and will therefore run 100 times."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Exercise: Launch Parallel Kernels\n",
    "\n",
    "[`01-first-parallel.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/02-first-parallel/01-basic-parallel.cu) currently makes a very basic function call that prints the message `This should be running in parallel.` Follow the steps below to refactor it first to run on the GPU, and then, in parallel, both in a single, and then, in multiple thread blocks. Refer to [the solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/02-first-parallel/solutions/01-basic-parallel-solution.cu) if you get stuck.\n",
    "\n",
    "- Refactor the `firstParallel` function to launch as a CUDA kernel on the GPU. You should still be able to see the output of the function after compiling and running `01-basic-parallel.cu` with the `nvcc` command just below.\n",
    "- Refactor the `firstParallel` kernel to execute in parallel on 5 threads, all executing in a single thread block. You should see the output message printed 5 times after compiling and running the code.\n",
    "- Refactor the `firstParallel` kernel again, this time to execute in parallel inside 5 thread blocks, each containing 5 threads. You should see the output message printed 25 times now after compiling and running."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvcc -arch=sm_70 -o basic-parallel 02-first-parallel/01-basic-parallel.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## CUDA-Provided Thread Hierarchy Variables\n",
    "\n",
    "The following slides present upcoming material visually, at a high level. Click through the slides before moving on to more detailed coverage of their topics in following sections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vSVS21bI-cje3Cqtxke-LHcvxk1ZxvZF-F35bgHSKfvNsvkGklCeqwlXHCDPJey5meZ1vTVYMqiF0UV/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%HTML\n",
    "\n",
    "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vSVS21bI-cje3Cqtxke-LHcvxk1ZxvZF-F35bgHSKfvNsvkGklCeqwlXHCDPJey5meZ1vTVYMqiF0UV/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Thread and Block Indices\n",
    "\n",
    "Each thread is given an index within its thread block, starting at `0`. Additionally, each block is given an index, starting at `0`. Just as threads are grouped into thread blocks, blocks are grouped into a **grid**, which is the highest entity in the CUDA thread hierarchy. In summary, CUDA kernels are executed in a grid of 1 or more blocks, with each block containing the same number of 1 or more threads.\n",
    "\n",
    "CUDA kernels have access to special variables identifying both the index of the thread (within the block) that is executing the kernel, and, the index of the block (within the grid) that the thread is within. These variables are `threadIdx.x` and `blockIdx.x` respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Exercise: Use Specific Thread and Block Indices\n",
    "\n",
    "Currently the [`01-thread-and-block-idx.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/03-indices/01-thread-and-block-idx.cu) file contains a working kernel that is printing a failure message. Open the file to learn how to update the execution configuration so that the success message will print. After refactoring, compile and run the code with the code execution cell below to confirm your work. Refer to [the solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/03-indices/solutions/01-thread-and-block-idx-solution.cu) if you get stuck."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvcc -arch=sm_70 -o thread-and-block-idx 03-indices/01-thread-and-block-idx.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Accelerating For Loops\n",
    "\n",
    "For loops in CPU-only applications are ripe for acceleration: rather than run each iteration of the loop serially, each iteration of the loop can be run in parallel in its own thread. Consider the following for loop, and notice, though it is obvious, that it controls how many times the loop will execute, as well as defining what will happen for each iteration of the loop:\n",
    "\n",
    "```cpp\n",
    "int N = 2<<20;\n",
    "for (int i = 0; i < N; ++i)\n",
    "{\n",
    "  printf(\"%d\\n\", i);\n",
    "}\n",
    "```\n",
    "\n",
    "In order to parallelize this loop, 2 steps must be taken:\n",
    "\n",
    "- A kernel must be written to do the work of a **single iteration of the loop**.\n",
    "- Because the kernel will be agnostic of other running kernels, the execution configuration must be such that the kernel executes the correct number of times, for example, the number of times the loop would have iterated."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Exercise: Accelerating a For Loop with a Single Block of Threads\n",
    "\n",
    "Currently, the `loop` function inside [`01-single-block-loop.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/04-loops/01-single-block-loop.cu), runs a for loop that will serially print the numbers `0` through `9`. Refactor the `loop` function to be a CUDA kernel which will launch to execute `N` iterations in parallel. After successfully refactoring, the numbers `0` through `9` should still be printed. Refer to [the solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/04-loops/solutions/01-single-block-loop-solution.cu) if you get stuck."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvcc -arch=sm_70 -o single-block-loop 04-loops/01-single-block-loop.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Coordinating Parallel Threads\n",
    "\n",
    "The following slides present upcoming material visually, at a high level. Click through the slides before moving on to more detailed coverage of their topics in following sections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vSfi8LAinJ1RqTzlB2vRsAcDzCCk9gZov5rQODN5rtRMPt57UizCVv5LSVZ5WLxGtrsMm7FIkLb0wMR/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%HTML\n",
    "\n",
    "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vSfi8LAinJ1RqTzlB2vRsAcDzCCk9gZov5rQODN5rtRMPt57UizCVv5LSVZ5WLxGtrsMm7FIkLb0wMR/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Using Block Dimensions for More Parallelization\n",
    "\n",
    "There is a limit to the number of threads that can exist in a thread block: 1024 to be precise. In order to increase the amount of parallelism in accelerated applications, we must be able to coordinate among multiple thread blocks.\n",
    "\n",
    "CUDA Kernels have access to a special variable that gives the number of threads in a block: `blockDim.x`. Using this variable, in conjunction with `blockIdx.x` and `threadIdx.x`, increased parallelization can be accomplished by organizing parallel execution accross multiple blocks of multiple threads with the idiomatic expression `threadIdx.x + blockIdx.x * blockDim.x`. Here is a detailed example.\n",
    "\n",
    "The execution configuration `<<<10, 10>>>` would launch a grid with a total of 100 threads, contained in 10 blocks of 10 threads. We would therefore hope for each thread to have the ability to calculate some index unique to itself between `0` and `99`.\n",
    "\n",
    "- If block `blockIdx.x` equals `0`, then `blockIdx.x * blockDim.x` is `0`. Adding to `0` the possible `threadIdx.x` values `0` through `9`, then we can generate the indices `0` through `9` within the 100 thread grid.\n",
    "- If block `blockIdx.x` equals `1`, then `blockIdx.x * blockDim.x` is `10`. Adding to `10` the possible `threadIdx.x` values `0` through `9`, then we can generate the indices `10` through `19` within the 100 thread grid.\n",
    "- If block `blockIdx.x` equals `5`, then `blockIdx.x * blockDim.x` is `50`. Adding to `50` the possible `threadIdx.x` values `0` through `9`, then we can generate the indices `50` through `59` within the 100 thread grid.\n",
    "- If block `blockIdx.x` equals `9`, then `blockIdx.x * blockDim.x` is `90`. Adding to `90` the possible `threadIdx.x` values `0` through `9`, then we can generate the indices `90` through `99` within the 100 thread grid."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Exercise: Accelerating a For Loop with Multiple Blocks of Threads\n",
    "\n",
    "Currently, the `loop` function inside [`02-multi-block-loop.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/04-loops/02-multi-block-loop.cu) runs a for loop that will serially print the numbers `0` through `9`. Refactor the `loop` function to be a CUDA kernel which will launch to execute `N` iterations in parallel. After successfully refactoring, the numbers `0` through `9` should still be printed. For this exercise, as an additional constraint, use an execution configuration that launches *at least 2 blocks of threads.* Refer to [the solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/04-loops/solutions/02-multi-block-loop-solution.cu) if you get stuck."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvcc -arch=sm_70 -o multi-block-loop 04-loops/02-multi-block-loop.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Allocating Memory to be accessed on the GPU and the CPU\n",
    "\n",
    "More recent versions of CUDA (version 6 and later) have made it easy to allocate memory that is available to both the CPU host and any number of GPU devices, and while there are many [intermediate and advanced techniques](http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#memory-optimizations) for memory management that will support the most optimal performance in accelerated applications, the most basic CUDA memory management technique we will now cover supports fantastic performance gains over CPU-only applications with almost no developer overhead.\n",
    "\n",
    "To allocate and free memory, and obtain a pointer that can be referenced in both host and device code, replace calls to `malloc` and `free` with `cudaMallocManaged` and `cudaFree` as in the following example:\n",
    "\n",
    "```cpp\n",
    "// CPU-only\n",
    "\n",
    "int N = 2<<20;\n",
    "size_t size = N * sizeof(int);\n",
    "\n",
    "int *a;\n",
    "a = (int *)malloc(size);\n",
    "\n",
    "// Use `a` in CPU-only program.\n",
    "\n",
    "free(a);\n",
    "```\n",
    "\n",
    "```cpp\n",
    "// Accelerated\n",
    "\n",
    "int N = 2<<20;\n",
    "size_t size = N * sizeof(int);\n",
    "\n",
    "int *a;\n",
    "// Note the address of `a` is passed as first argument.\n",
    "cudaMallocManaged(&a, size);\n",
    "\n",
    "// Use `a` on the CPU and/or on any GPU in the accelerated system.\n",
    "\n",
    "cudaFree(a);\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Exercise: Array Manipulation on both the Host and Device\n",
    "\n",
    "The [`01-double-elements.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/05-allocate/01-double-elements.cu) program allocates an array, initializes it with integer values on the host, attempts to double each of these values in parallel on the GPU, and then confirms whether or not the doubling operations were successful, on the host. Currently the program will not work: it is attempting to interact on both the host and the device with an array at pointer `a`, but has only allocated the array (using `malloc`) to be accessible on the host. Refactor the application to meet the following conditions, referring to [the solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/05-allocate/solutions/01-double-elements-solution.cu) if you get stuck:\n",
    "\n",
    "- `a` should be available to both host and device code.\n",
    "- The memory at `a` should be correctly freed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvcc -arch=sm_70 -o double-elements 05-allocate/01-double-elements.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grid Size Work Amount Mismatch\n",
    "\n",
    "The following slides present upcoming material visually, at a high level. Click through the slides before moving on to more detailed coverage of their topics in following sections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vTn2HX1FyUO94g5TxgBm0C7pu-_5UXPwYtMkhGLnqgs0-2Y1g8CE3YCuZuob25wrrXz0x8cT9_XxyBl/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%HTML\n",
    "\n",
    "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vTn2HX1FyUO94g5TxgBm0C7pu-_5UXPwYtMkhGLnqgs0-2Y1g8CE3YCuZuob25wrrXz0x8cT9_XxyBl/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Handling Block Configuration Mismatches to Number of Needed Threads\n",
    "\n",
    "It may be the case that an execution configuration cannot be expressed that will create the exact number of threads needed for parallelizing a loop.\n",
    "\n",
    "A common example has to do with the desire to choose optimal block sizes. For example, due to GPU hardware traits, blocks that contain a number of threads that are a multiple of 32 are often desirable for performance benefits. Assuming that we wanted to launch blocks each containing 256 threads (a multiple of 32), and needed to run 1000 parallel tasks (a trivially small number for ease of explanation), then there is no number of blocks that would produce an exact total of 1000 threads in the grid, since there is no integer value 32 can be multiplied by to equal exactly 1000.\n",
    "\n",
    "This scenario can be easily addressed in the following way:\n",
    "\n",
    "- Write an execution configuration that creates **more** threads than necessary to perform the allotted work.\n",
    "- Pass a value as an argument into the kernel (`N`) that represents to the total size of the data set to be processed, or the total threads that are needed to complete the work.\n",
    "- After calculating the thread's index within the grid (using `tid+bid*bdim`), check that this index does not exceed `N`, and only perform the pertinent work of the kernel if it does not.\n",
    "\n",
    "Here is an example of an idiomatic way to write an execution configuration when both `N` and the number of threads in a block are known, and an exact match between the number of threads in the grid and `N` cannot be guaranteed. It ensures that there are always at least as many threads as needed for `N`, and only 1 additional block's worth of threads extra, at most:\n",
    "\n",
    "```cpp\n",
    "// Assume `N` is known\n",
    "int N = 100000;\n",
    "\n",
    "// Assume we have a desire to set `threads_per_block` exactly to `256`\n",
    "size_t threads_per_block = 256;\n",
    "\n",
    "// Ensure there are at least `N` threads in the grid, but only 1 block's worth extra\n",
    "size_t number_of_blocks = (N + threads_per_block - 1) / threads_per_block;\n",
    "\n",
    "some_kernel<<<number_of_blocks, threads_per_block>>>(N);\n",
    "```\n",
    "\n",
    "Becuase the execution configuration above results in more threads in the grid than `N`, care will need to be taken inside of the `some_kernel` definition so that `some_kernel` does not attempt to access out of range data elements, when being executed by one of the \"extra\" threads:\n",
    "\n",
    "```cpp\n",
    "__global__ some_kernel(int N)\n",
    "{\n",
    "  int idx = threadIdx.x + blockIdx.x * blockDim.x;\n",
    "\n",
    "  if (idx < N) // Check to make sure `idx` maps to some value within `N`\n",
    "  {\n",
    "    // Only do work if it does\n",
    "  }\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Exercise: Accelerating a For Loop with a Mismatched Execution Configuration\n",
    "\n",
    "The program in [`02-mismatched-config-loop.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/05-allocate/02-mismatched-config-loop.cu) allocates memory, using `cudaMallocManaged` for a 1000 element array of integers, and then seeks to initialize all the values of the array in parallel using a CUDA kernel. This program assumes that both `N` and the number of `threads_per_block` are known. Your task is to complete the following two objectives, refer to [the solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/05-allocate/solutions/02-mismatched-config-loop-solution.cu) if you get stuck:\n",
    "\n",
    "- Assign a value to `number_of_blocks` that will make sure there are at least as many threads as there are elements in `a` to work on.\n",
    "- Update the `initializeElementsTo` kernel to make sure that it does not attempt to work on data elements that are out of range."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvcc -arch=sm_70 -o mismatched-config-loop 05-allocate/02-mismatched-config-loop.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Grid-Stride Loops\n",
    "\n",
    "The following slides present upcoming material visually, at a high level. Click through the slides before moving on to more detailed coverage of their topics in following sections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vTSfcPagyv7ObRnhygFnKrvDIDa-wUuc3yR-qs7xd4gQxProMOqXzNqe8y9vz711cLIbPp1qYJc7R3l/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%HTML\n",
    "\n",
    "<div align=\"center\"><iframe src=\"https://docs.google.com/presentation/d/e/2PACX-1vTSfcPagyv7ObRnhygFnKrvDIDa-wUuc3yR-qs7xd4gQxProMOqXzNqe8y9vz711cLIbPp1qYJc7R3l/embed?start=false&loop=false&delayms=3000\" frameborder=\"0\" width=\"900\" height=\"550\" allowfullscreen=\"true\" mozallowfullscreen=\"true\" webkitallowfullscreen=\"true\"></iframe></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Data Sets Larger then the Grid\n",
    "\n",
    "Either by choice, often to create the most performant execution configuration, or out of necessity, the number of threads in a grid may be smaller than the size of a data set. Consider an array with 1000 elements, and a grid with 250 threads (using trivial sizes here for ease of explanation). Here, each thread in the grid will need to be used 4 times. One common method to do this is to use a **grid-stride loop** within the kernel.\n",
    "\n",
    "In a grid-stride loop, each thread will calculate its unique index within the grid using `tid+bid*bdim`, perform its operation on the element at that index within the array, and then, add to its index the number of threads in the grid and repeat, until it is out of range of the array. For example, for a 500 element array and a 250 thread grid, the thread with index 20 in the grid would:\n",
    "\n",
    "- Perform its operation on element 20 of the 500 element array\n",
    "- Increment its index by 250, the size of the grid, resulting in 270\n",
    "- Perform its operation on element 270 of the 500 element array\n",
    "- Increment its index by 250, the size of the grid, resulting in 520\n",
    "- Because 520 is now out of range for the array, the thread will stop its work\n",
    "\n",
    "CUDA provides a special variable giving the number of blocks in a grid, `gridDim.x`. Calculating the total number of threads in a grid then is simply the number of blocks in a grid multiplied by the number of threads in each block, `gridDim.x * blockDim.x`. With this in mind, here is a verbose example of a grid-stride loop within a kernel:\n",
    "\n",
    "```cpp\n",
    "__global void kernel(int *a, int N)\n",
    "{\n",
    "  int indexWithinTheGrid = threadIdx.x + blockIdx.x * blockDim.x;\n",
    "  int gridStride = gridDim.x * blockDim.x;\n",
    "\n",
    "  for (int i = indexWithinTheGrid; i < N; i += gridStride)\n",
    "  {\n",
    "    // do work on a[i];\n",
    "  }\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Exercise: Use a Grid-Stride Loop to Manipulate an Array Larger than the Grid\n",
    "\n",
    "Refactor [`03-grid-stride-double.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/05-allocate/03-grid-stride-double.cu) to use a grid-stride loop in the `doubleElements` kernel, in order that the grid, which is smaller than `N`, can reuse threads to cover every element in the array. The program will print whether or not every element in the array has been doubled, currently the program accurately prints `FALSE`. Refer to [the solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/05-allocate/solutions/03-grid-stride-double-solution.cu) if you get stuck."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvcc -arch=sm_70 -o grid-stride-double 05-allocate/03-grid-stride-double.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Error Handling\n",
    "\n",
    "As in any application, error handling in accelerated CUDA code is essential. Many, if not most CUDA functions (see, for example, the [memory management functions](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY)) return a value of type `cudaError_t`, which can be used to check whether or not an error occured while calling the function. Here is an example where error handling is performed for a call to `cudaMallocManaged`:\n",
    "\n",
    "```cpp\n",
    "cudaError_t err;\n",
    "err = cudaMallocManaged(&a, N)                    // Assume the existence of `a` and `N`.\n",
    "\n",
    "if (err != cudaSuccess)                           // `cudaSuccess` is provided by CUDA.\n",
    "{\n",
    "  printf(\"Error: %s\\n\", cudaGetErrorString(err)); // `cudaGetErrorString` is provided by CUDA.\n",
    "}\n",
    "```\n",
    "\n",
    "Launching kernels, which are defined to return `void`, do not return a value of type `cudaError_t`. To check for errors occuring at the time of a kernel launch, for example if the launch configuration is erroneous, CUDA provides the `cudaGetLastError` function, which does return a value of type `cudaError_t`.\n",
    "\n",
    "```cpp\n",
    "/*\n",
    " * This launch should cause an error, but the kernel itself\n",
    " * cannot return it.\n",
    " */\n",
    "\n",
    "someKernel<<<1, -1>>>();  // -1 is not a valid number of threads.\n",
    "\n",
    "cudaError_t err;\n",
    "err = cudaGetLastError(); // `cudaGetLastError` will return the error from above.\n",
    "if (err != cudaSuccess)\n",
    "{\n",
    "  printf(\"Error: %s\\n\", cudaGetErrorString(err));\n",
    "}\n",
    "```\n",
    "\n",
    "Finally, in order to catch errors that occur asynchronously, for example during the execution of an asynchronous kernel, it is essential to check the status returned by a subsequent synchronizing cuda runtime API call, such as `cudaDeviceSynchronize`, which will return an error if one of the kernels launched previously should fail."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Exercise: Add Error Handling\n",
    "\n",
    "Currently [`01-add-error-handling.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/06-errors/01-add-error-handling.cu) compiles, runs, and prints that the elements of the array were not successfully doubled. The program does not, however, indicate that there are any errors within it. Refactor the application to handle CUDA errors so that you can learn what is wrong with the program and effectively debug it. You will need to investigate both synchronous errors potentially created when calling CUDA functions, as well as asynchronous errors potentially created while a CUDA kernel is executing. Refer to [the solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/06-errors/solutions/01-add-error-handling-solution.cu) if you get stuck."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvcc -arch=sm_70 -o add-error-handling 06-errors/01-add-error-handling.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### CUDA Error Handling Function\n",
    "\n",
    "It can be helpful to create a macro that wraps CUDA function calls for checking errors. Here is an example, feel free to use it in the remaining exercises:\n",
    "\n",
    "```cpp\n",
    "#include <stdio.h>\n",
    "#include <assert.h>\n",
    "\n",
    "inline cudaError_t checkCuda(cudaError_t result)\n",
    "{\n",
    "  if (result != cudaSuccess) {\n",
    "    fprintf(stderr, \"CUDA Runtime Error: %s\\n\", cudaGetErrorString(result));\n",
    "    assert(result == cudaSuccess);\n",
    "  }\n",
    "  return result;\n",
    "}\n",
    "\n",
    "int main()\n",
    "{\n",
    "\n",
    "/*\n",
    " * The macro can be wrapped around any function returning\n",
    " * a value of type `cudaError_t`.\n",
    " */\n",
    "\n",
    "  checkCuda( cudaDeviceSynchronize() )\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Summary\n",
    "\n",
    "At this point in time you have accomplished all of the following lab objectives:\n",
    "\n",
    "- Write, compile, and run C/C++ programs that both call CPU functions and **launch** GPU **kernels**.\n",
    "- Control parallel **thread hierarchy** using **execution configuration**.\n",
    "- Refactor serial loops to execute their iterations in parallel on a GPU.\n",
    "- Allocate and free memory available to both CPUs and GPUs.\n",
    "- Handle errors generated by CUDA code.\n",
    "\n",
    "Now you will complete the final objective of the lab:\n",
    "\n",
    "- Accelerate CPU-only applications."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Final Exercise: Accelerate Vector Addition Application\n",
    "\n",
    "The following challenge will give you an opportunity to use everything that you have learned thus far in the lab. It involves accelerating a CPU-only vector addition program, which, while not the most sophisticated program, will give you an opportunity to focus on what you have learned about GPU-accelerating an application with CUDA. After completing this exercise, if you have time and interest, continue on to the *Advanced Content* section for some challenges that involve more complex codebases.\n",
    "\n",
    "[`01-vector-add.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/07-vector-add/01-vector-add.cu) contains a functioning CPU-only vector addition application. Accelerate its `addVectorsInto` function to run as a CUDA kernel on the GPU and to do its work in parallel. Consider the following that need to occur, and refer to [the solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/07-vector-add/solutions/01-vector-add-solution.cu) if you get stuck.\n",
    "\n",
    "- Augment the `addVectorsInto` definition so that it is a CUDA kernel.\n",
    "- Choose and utilize a working execution configuration so that `addVectorsInto` launches as a CUDA kernel.\n",
    "- Update memory allocations, and memory freeing to reflect that the 3 vectors `a`, `b`, and `result` need to be accessed by host and device code.\n",
    "- Refactor the body of `addVectorsInto`: it will be launched inside of a single thread, and only needs to do one thread's worth of work on the input vectors. Be certain the thread will never try to access elements outside the range of the input vectors, and take care to note whether or not the thread needs to do work on more than one element of the input vectors.\n",
    "- Add error handling in locations where CUDA code might otherwise silently fail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvcc -arch=sm_70 -o vector-add 07-vector-add/01-vector-add.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Advanced Content\n",
    "\n",
    "The following exercises provide additional challenge for those with time and interest. They require the use of more advanced techniques, and provide less scaffolding. They are difficult and excellent for your development."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Grids and Blocks of 2 and 3 Dimensions\n",
    "\n",
    "Grids and blocks can be defined to have up to 3 dimensions. Defining them with multiple dimensions does not impact their performance in any way, but can be very helpful when dealing with data that has multiple dimensions, for example, 2d matrices. To define either grids or blocks with two or 3 dimensions, use CUDA's `dim3` type as such:\n",
    "\n",
    "```cpp\n",
    "dim3 threads_per_block(16, 16, 1);\n",
    "dim3 number_of_blocks(16, 16, 1);\n",
    "someKernel<<<number_of_blocks, threads_per_block>>>();\n",
    "```\n",
    "\n",
    "Given the example just above, the variables `gridDim.x`, `gridDim.y`, `blockDim.x`, and `blockDim.y` inside of `someKernel`, would all be equal to `16`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Exercise: Accelerate 2D Matrix Multiply Application\n",
    "\n",
    "The file [`01-matrix-multiply-2d.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/08-matrix-multiply/01-matrix-multiply-2d.cu) contains a host function `matrixMulCPU` which is fully functional. Your task is to build out the `matrixMulGPU` CUDA kernel. The source code will execute the matrix multiplication with both functions, and compare their answers to verify the correctness of the CUDA kernel you will be writing. Use the following guidelines to support your work and refer to [the solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/08-matrix-multiply/solutions/01-matrix-multiply-2d-solution.cu) if you get stuck:\n",
    "\n",
    "- You will need to create an execution configuration whose arguments are both `dim3` values with the `x` and `y` dimensions set to greater than `1`.\n",
    "- Inside the body of the kernel, you will need to establish the running thread's unique index within the grid per usual, but you should establish two indices for the thread: one for the x axis of the grid, and one for the y axis of the grid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "FOUND ERROR at c[1][0]\r\n"
     ]
    }
   ],
   "source": [
    "!nvcc -arch=sm_70 -o matrix-multiply-2d 08-matrix-multiply/01-matrix-multiply-2d.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### Exercise: Accelerate A Thermal Conductivity Application\n",
    "\n",
    "In the following exercise, you will be accelerating an application that simulates the thermal conduction of silver in 2 dimensional space.\n",
    "\n",
    "Convert the `step_kernel_mod` function inside [`01-heat-conduction.cu`](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/09-heat/01-heat-conduction.cu) to execute on the GPU, and modify the `main` function to properly allocate data for use on CPU and GPU. The `step_kernel_ref` function executes on the CPU and is used for error checking. Because this code involves floating point calculations, different processors, or even simply reording operations on the same processor, can result in slightly different results. For this reason the error checking code uses an error threshold, instead of looking for an exact match. Refer to [the solution](../../../../../edit/tasks/task1/task/01_AC_CUDA_C/09-heat/solutions/01-heat-conduction-solution.cu) if you get stuck."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!nvcc -arch=sm_70 -o heat-conduction 09-heat/01-heat-conduction.cu -run"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Credit for the original Heat Conduction CPU source code in this task is given to the article [An OpenACC Example Code for a C-based heat conduction code](http://docplayer.net/30411068-An-openacc-example-code-for-a-c-based-heat-conduction-code.html) from the University of Houston."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
